Biologically Inspired Hierarchical Model for Feature Extraction and Localization 
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Abstract 

Feature extraction and matching are among central 
problems of computer vision. It is inefficent to search fea- 
tures over all locations and scales. Neurophysiological ev- 
idence shows that to locate objects in a digital image the 
human visual system employs visual attention to a specific 
object while ignoring others. The brain also has a mecha- 
nism to search from coarse to fine. In this paper, we present 
a feature extractor and an associated hierarchical search- 
ing model to simulate such processes. With the hierarchi- 
cal representation of the object, coarse scanning is done 
through the matching of the larger scale and precise local- 
ization is conducted through the matching of the smaller 
scale. Experimental results justify the proposed model in its 
effectiveness and efficiency to localize features. 



1. Introduction 

In computer vision, there are two central problems: the 
extraction of robust features and the subsequent precise lo- 
calization of those features. It has an expensive computa- 
tional cost to search features at every location and scale. It 
is common to use interest operators Q], salient feature de- 
tectors 1 2 1, or selective visual attention f4l to select features 
for learning. However, matching is still computationally ex- 
pensive given that the number of key points is usually over 
10,000. f3l|. 

In this paper, we use a different approach placing the em- 
phasis on efficient matching in a top-down attention man- 
ner. The model is inspired by human perception. When we 
search for an object in an image, keeping in mind roughly 
what the object looks like, we jump from one region to an- 
other with attention to some aspects of that object while 
ignoring others. From these aspects, we can finally local- 
ize the object. To build such a model in computer vision, 
a hierarchical representation of an object is given at many 
scales. Coarse scanning is done through the matching of the 
larger scale and precise localization is conducted through 
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Figure 1. Two elements on 2 layers. 

the matching of the smaller scale. 

2 Feature Description 

For feature extraction, we use Gabor filters for the reason 
that Gabor filters have been shown to be good simulations of 
visual cortex and they give a sparse representation of images 

Ellll. 

2D Gabor kernels are characterized by the following 
equation, 

1 4(j; COS a + y sin fl)^ + (B cob O-ri ain 9)^ ) 

vSttct 

sin(27r/o(a:;cos0 + y sin6')). (1) 



where there are 3 parameters. Spatial frequency /q and 
scale a can be combined using frequency bandwidth (f) with 
27r/ocr = 2VT^{2'*' + 1)(2'^ - 1) |5|. Since the frequency 
bandwidths of simple and complex cells have been found to 
range from 0.5 to 2.5 octaves centered around 1.2 and 1.5 
octaves J6| El . We set cf) — 1.5 in our system. 

Note that in Gabor kernel we use sin only with phase 
shift. Different phases are crudely approximated by con- 
volving the image with the kernel at all locations. Maxi- 
mum operation is applied to make the detection tolerant to 
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Figure 2. 2 steps of matching. Thie solid cir- 
cles represent matched elements while the 
dashed circles the whole layer. At each step, 
C.L. is calculated so that the relative position 
of CM. and C.L. in the new image is the same 
as that of CM. and T.L. in the old image up 
to a scale factor. C.L. is the initial of cor- 
rected location, CM., center of matched el- 
ement, T.L., training location, I.L., initial loca- 
tion, and F.L., final localized location. 



image distortions. The pooling range is only within each 
receptive field whose size is 1\/2a. It is much smaller than 
the range used in other articles so that we do not lose 
much information about the locations of features which can 
be recovered in the subsequent matching. We do not take 
max operation across scale so that we do not lose scale in- 
formation either. 

In our system, receptive fields (RFs) are arranged simi- 
larly on M different layers. The ratio of the size of each RF 
on a larger layer to that of its nearest smaller layer is kept at 
-\/2. We use this arrangement to capture features at differ- 
ent scales which later will also serve as an efficient way of 
matching. 

There are 19 feature detectors on each layer which are 
called elements. Each element is composed of 19 adjacent 
RFs. As an illustration, Fig.l gives 2 elements on 2 lay- 
ers. Each circle is a RF. Bold circles form an element. The 
centers of all elements are denoted by plus signs. Note that 
in this figure as well as eleswhere in this paper, each unit 
represents one pixel. 

The responses of all RFs within one element are col- 
lected to be a feature vector. In our notation, the jth com- 
ponent of a feature vector can be written as Rf„j, which 
indicates the response of ^th element on mth layer j is the 
RF index which ranges from 1 to 19. 

To eUminate local contrast change, we normalize the ga- 
bor intensity and map it to 16 integers. Precisely, the feature 



vector on mth layer is obtained as follows, 
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where ceil rounds a number up to the nearest integer, G 
is the output of the convolution of Gabor filters 0/^ .e.crm 
within jth RF, and the normalization (so the max and min 
operation) is taken over all RFs within the area /th layer 
covered. The same normalization and mapping procedure 
is followed on all layers so that we can compare the feature 
vectors across scales. 

3 Feature Matching and Point Localization 

In a reference image, some key points are first selected 
for training using some saliency based algorithms. For 
recognition, we need to find those points in a new image 
that correspond to the selected points. 

It is inefficient to search all possible locations and scales 
in the image. Because of the great benefit to reduce non- 
relevant regions from pre-selection stage, we first use a 
saliency based algorithm to select some subregions and 
build our search model there. 

The idea for searching is based on the multi-scaled rep- 
resentation of an object. Matching of features at differ- 
ent scales yields different resolutions. The matching of the 
larger layers works as coarse scanning and the matching of 
smaller layers works as fine localization. Suppose we want 
to localize a feature to the resolution of rxr in aAxi? 
image. The computation complexity for the whole scan- 
ning is ^2;4^, where S is the number of scales at which the 
object is supposed to be recognizable. Using our system 
with M layered representation, we can reduce the complex- 
ity to — Jlf^M , with V2 being the ratio of the sizes of 

■' r^yS ^/2 ^ 

2 adjacent layers. The computational complexity has been 
reduced about 2*^. 

The searching is done in 2 steps. 

First, the matching of the largest layer gives a coarse 
scanning. For a chosen M layered mask (assuming that we 
want to detect an object on M scales), we only get the fea- 
ture vectors of the largest layer for a coarse scanning by 
putting the Afth layer at an initial starting point. Then we 
match the feature vectors to the template in a nearest neigh- 
bor manner The similarity of 2 feature vectors are defined 
to be. 
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which takes value from to 1. Once the nearest neighbors 
are found, for example, to be /th element on A/th layer of 
the new image and I'th element on m' of the template. Bear- 
ing the idea that the matched elements capture the same re- 
gion of 2 images, we know that the scale of the new image 
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Figure 3. An image and its transformed im- 
age by sinewing, scaling, rotation, addition 
of pixel noise and change of brightness and 
contrast. 
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relative to that of the original image from which 
we form our template. Also, we know that the matched ele- 
ments should have the same relative position relative to the 
center of the mask weighted by the scale if the initial posi- 
tion is correctly placed. If the indices of the two matched 
elements are different I ^ V , the center of the mask is mis- 
placed in the new image. We should move the mask to 
X; — r^-™ Y'l, where X^ is the coordinate of the center of 
element {/, M}, and Y)l' is the relative position of element 
{I', m'} relative to the center of the mask in the template. 

Next, at the corrected location, feature vectors of the 
smallest layer (if M is very large, we may need larger layer 
as well as the smallest one) are obtained to match the cor- 
responding layer of the template, from which the target is 
localized. 

The algorithm to match a feature in a new image with the 
template can be listed as, 

• Pre-select some interesting subregions in a new image 
with saliency based algorithm. 

• Choose the central point from each subregion and ob- 
tain the feature vectors of the largest layer. 

• Find the nearest neighbor of the largest layer from the 
template. 

• Calculate the relative scale of two images and new ini- 
tial location. 

• Localize the target from the new location with the 
smallest layer. 



As an example, we demonstrate how this works using 
a Chinese word. First, the template is formed in the im- 
age centered around a Chinese word which is located at the 
training location (T.L.) [197, 259]. 

The localization is finished in 2 steps as shown in Fig. 2. 
The first step, we put the 5th layer near the target at ini- 
tial location (I.L.) in the new image (so that our system can 
recognize the same Chinese word from 2 times large to 2 
times smaller than the original training word). We collect 
responses from all the receptive fields on the 5th layer. Two 
matched elements (CM.) are found to be / = 12, m — 5 lo- 
cated at [137, 293] in the new image and /' = 15, m' = 6 at 
[120, 203] in the old image. The scale of the new image is 

identified to be \/2 times larger than the old image. 

Since / y^ I', the corrected location (C.L.) is calculated to 
be [120, 203] - ([137, 293] - [197, 259])/V2 = [162, 179]. 
The second step simply uses the 1st layer at the corrrected 
location calculated from step 1 and repeat the above proce- 
dure. The final localized point is [182, 179] which exactly 
corresponds to the trained point with high accuracy. 

Searching in each subregion yields one candidate for the 
object's location in the new image. In a new image that 
has multiple objects, a threshold is useful to identify the ob- 
jects. If the prior information is known that there is only 
one match in a new image, we can simply choose the best 
one which proves to be effective in the map mapping exper- 
iment (the false alarm rate for our representation is small). 
The final evaulation of these candidate locations is made us- 
ing the similarity function comparing one element or a set of 
elements (cf. to the experimental section for the difference 
of results) centering around each candidate location. The 
evaluation function for each located candidate is defined to 
be. 



Eval = ]j5im(Rj,R^), 



(4) 



where i goes from 1 to the number of elements T we want 
to use, and Rj and Rj are the corresponding feature vector 
for that element, respectively, in the new image and in the 
old image. 

4 Experimental examples 

Fig. 3 shows an example of points matching. The first 
image is a reference image and the second image is a trans- 
formed image. We random choose 100 points in the first im- 
age and search those points on the transformed image. We 
divide the image into 150 by 150 subregions for a 800 by 
800 picture. In each region, we localize a candidate and the 
best of these candidates is taken to be the recovered point. 
If the recovered point is less than 8 pixels away from the ac- 
tual location calculated from the transformation, a matching 
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A. Increase contrast by 1.2 
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96 


B. Decrease intensity by 0.2 
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C. Rotate by 10 degress 
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D.Scale by 0.7 
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E.Scale by 0.5 
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G. Skewed by 7 degrees 
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H.Scaleby 1.5 


95 
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88 
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Table 1. For various image transformations 
applied to the original image, the table gives 
the percentage of matching within 8 pixels 
away from the true target calculated accord- 
ing to the transformations. 



is found. Table 1 shows the results. The first column of data 
gives the matching rate when all elements {T ~ 19 in Equa. 
4) on the smallest layer are used for the evaluation while the 
last column gives the matching rate when the evaluation is 
done only on the matched element. Our algorithm is weak 
in the detection of rotation related transforms. We did noth- 
ing in the representation to make it rotationally invariant, 
although we can achieve this simply by shifting the Gabor 
intensity matrices fSl. This is justified because the human 
visual system is not rotationally invariant. 

The second experiment is designed for recognition. In 
the trained image, 5 key points are selected on an object for 
training, which are then recovered in the recognition image. 



5 Conclusion and Discussion 





(a) Trained Image 



(b) Recognition Image 



Figure 4. (a) 5 points selected on an object in 
the trained image and (b) 5 recovered points 
in the recognition image 
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